library(tidyverse)
library(modelr)
library(plotly)
mpg <- ggplot2::mpg
mpg_manual <- mpg %>%
filter(trans %in% c('manual(m5)', 'manual(m6)'))
library(openintro)
data(mariokart)
mario_kart <- mariokart %>%
filter(total_pr < 100)
rm(mariokart)
colnames(mario_kart) <- c("ID", "duration", "nBids", "cond", "startPr", "shipPr", "totalPr", "shipSp", "sellerRate", "stockPhoto", "wheels", "title")
SAT <- read.csv('/home/cla/Documentos/Vitor/DataCamp/Statistician-with-R//Datasets/SAT.csv')
1. Adding a numerical explanatory variable
Thus far, we’ve only considered multiple regression models with one numeric explanatory variable. In this chapter, we will explore models that have at least two numeric explanatory variables.
2. Adding a second numeric explanatory variable
Mathematically, adding a second numeric explanatory variable is trivial—we just add another term to our model. In this example, we are modeling the birthweight of babies born in San Francisco as a function of their pregnancy’s length of gestation (in weeks) and the mother’s age (in years). Note that both gestation and age are numeric variables here, so this is not a parallel slopes model. The syntax for fitting this model in R is similarly trivial—we simply extend the formula to incorporate both the gestation and age variables, just as we did before. Note also that since age is not categorical, we don’t have to worry about using the factor() function, or converting a categorical variable into binary variables, like we had to with year and is_newer.
lm(bwt ~ gestation + age, data = babies)
Call:
lm(formula = bwt ~ gestation + age, data = babies)
Coefficients:
(Intercept) gestation age
-15.5226 0.4676 0.1657
3. No longer a 2D problem
Unfortunately, while the mathematical and syntactical formulations of multiple regression models are easy extensions of things we already know, a visual formulation of these models becomes much trickier. You might be tempted to visualize our model using a ggplot expression like this. But unfortunately, this will not work, because ggplot only handles 2D graphics, and thus there is no z aesthetic.
ggplot(data = babies, aes(x = gestation, y = age, z = bwt)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
4. Data space is 3D
Our data space is now three dimensional, because bwt, gestation, and age are all numeric variables that our model encapsulates. So we will need to get a little bit creative in order to create a meaningful visualization of our model.
data_space1 <- ggplot(data = babies, aes(x = gestation, y = age)) +
geom_point(aes(color = bwt))
data_space1
5. Tiling the plane
One way to visualize a 3D model is to tile the plane. That is, we will create a 2D plot that covers all combinations of our two explanatory variables, and we will use color to reflect the corresponding fitted values. Here, we have created that grid of values using the data_grid() function, and then fed those into our model using the augment() function with the newdata argument. These two steps create a data frame of all possible combinations of gestation and age values, along with the model prediction for each.
library(broom)
grid1 <- babies %>%
data_grid(
gestation = seq_range(gestation, by = 1),
age = seq_range(age, by = 1)
)
mod1 <- lm(bwt ~ gestation + age, data = babies)
bwt_hats <- augment(mod1, newdata = grid1)
6. Tiles in the data space
We then use the geom_tile() function to superimpose these values on our data space. Setting the alpha argument to 0 point 5 allows us to see both the actual observations (as points) and the model predictions (as a smooth surface). Note how the color of the tiles lighten as you get closer to the upper right corner. This reflects that the model predicts heavier babies for older mothers with longer gestational periods.
mode_space1 <- data_space1 +
geom_tile(data = bwt_hats, aes(fill = .fitted, alpha = 0.5)) +
scale_fill_continuous('bwt', limits = range(babies$bwt))
7. 3D visualization
A perhaps more natural way to visualize a multiple regression model is as a plane through a cloud of points in three dimensions. Here, we use the plotly package to illustrate our model for the birthweight of babies. Note how the plane cuts through the data points. The residual associated with each observation is the vertical distance between that point and the plane. Believe it or not, this plane is the one that uniquely minimizes the sum of of the squared residuals between itself and these data points. The plot_ly syntax is similar to that of ggplot. In this case, the add_markers() function draws the points, while the add_surface() function draws the plane. The x, y, and plane objects were created in code that we aren’t showing here due to its complexity.
plot_ly(data = babies, z = ~bwt, x = ~gestation, y = ~age, opacity = 0.6) %>%
add_markers(marker = list(size = 2)) %>%
add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE, cmin = 0, cmax = 1, surfacecolor = 'grey', colorscale = 'grey')
Ignoring 15 observationsminimal value for n is 3, returning requested palette with 3 different levels
Ignoring 15 observationsminimal value for n is 3, returning requested palette with 3 different levels
8. Let’s practice!
Now it’s time for you to build some multiple regression models.
In terms of the R code, fitting a multiple linear regression model is easy: simply add variables to the model formula you specify in the lm() command.
In a parallel slopes model, we had two explanatory variables: one was numeric and one was categorical. Here, we will allow both explanatory variables to be numeric.
Instructions
The dataset mario_kart is already loaded in your workspace.
# Fit the model using duration and startPr
lm(totalPr ~ duration + startPr, data = mario_kart)
Call:
lm(formula = totalPr ~ duration + startPr, data = mario_kart)
Coefficients:
(Intercept) duration startPr
51.030 -1.508 0.233
One method for visualizing a multiple linear regression model is to create a heatmap of the fitted values in the plane defined by the two explanatory variables. This heatmap will illustrate how the model output changes over different combinations of the explanatory variables.
This is a multistep process:
grid.augment() with the newdata argument to find the \(\h\t{y\)}’s corresponding to the values in grid.data_space plot by using the fill aesthetic and geom_tile().Instructions
The model object mod is already in your workspace.
augment() to create a data.frame that contains the values the model outputs for each row of grid.geom_tile to illustrate these predicted values over the data_space plot. Use the fill aesthetic and set alpha = 0.5.# add predictions to grid
price_hats <- augment(mod, newdata = grid)
# tile the plane
data_space +
geom_tile(data = price_hats, aes(fill = .fitted), alpha = 0.5)
An alternative way to visualize a multiple regression model with two numeric explanatory variables is as a plane in three dimensions. This is possible in R using the plotly package.
We have created three objects that you will need:
x: a vector of unique values of durationy: a vector of unique values of startPrplane: a matrix of the fitted values across all combinations of x and yx <- seq_range(mario_kart$duration, n = 70)
y <- seq_range(mario_kart$startPr, n = 70)
new_data <- data.frame(duration = x, startPr = y)
plane <- matrix(nrow = 70, ncol = 70)
for (i in seq_along(x)) {
plane[, i] <- predict(mod, data.frame(duration = x, startPr = y[i]))
}
Much like ggplot(), the plot_ly() function will allow you to create a plot object with variables mapped to x, y, and z aesthetics. The add_markers() function is similar to geom_point() in that it allows you to add points to your 3D plot.
Note that plot_ly uses the pipe (%>%) operator to chain commands together.
Instructions
plot_ly command to draw 3D scatterplot for totalPr as a function of duration and startPr by mapping the z variable to the response and the x and y variables to the explanatory variables. Duration should be on the x-axis and starting price should be on the y-axis.add_surface() to draw a plane through the cloud of points by setting z = ~plane.# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~totalPr, x = ~duration, y = ~startPr, opacity = 0.6) %>%
add_markers(marker = list(size = 2))
# draw the plane
p %>%
add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
1. Conditional interpretation of coefficients
What do the coefficients in a multiple regression model mean?
2. Two slope coefficients
Consider the fitted coefficients for our model for the birthweight of babies shown above. Since both of our explanatory variables are numeric, the coefficients of both gestation and age represent slopes. But slopes of what? A line certainly can’t have two slopes, but as we saw previously, our model is not a line: it is a plane. And a plane can have two slopes. Note that in this case both slopes are positive, but have different magnitudes.
mod1
Call:
lm(formula = bwt ~ gestation + age, data = babies)
Coefficients:
(Intercept) gestation age
-15.5226 0.4676 0.1657
3. Tiled plane
When we depicted our model as a tiled surface in the plane, we noticed that the color of the tiles got lighter and bluer as we moved towards the upper right hand corner of the plot. This reflects the fact that our model predicts higher birthweights for babies with longer gestational periods born to older mothers. But at what rate does this color change?
mode_space1
4. Tiled plane plus first slope
For a 30-year-old mother, the increase in expected birthweight is 0 point 47 ounces per day, and this can be viewed on the plot as the rate of color change as you move horizontally through the plot along a straight line. It is in this sense that the coefficient of gestation is a slope. However, this rate of change is constant across mothers of all ages. It doesn’t matter whether you are 20, 30, or 40. While the expected birthweight does depend on the mother’s age, the rate of change in birthweight with respect to gestational length does not depend on the the mother’s age. Thus, the coefficient of 0.47 ounces per day on gestation reflects the slope of the plane for a fixed value of age. And since this slope doesn’t change with respect to age, we often say that it reflects the effect of gestational length upon birthweight, while holding age constant.
mode_space1 + geom_hline(yintercept = 30, color = 'red')
5. Tiled plane plus second slope
Similarly, at the typical gestational length of 40 weeks (or 280 days), the expected birthweight increases at a rate of 0 point 17 ounces per year of the mother’s age. That is, our model predicts that the birthweight of a 36-year-old mother is about one-sixth of an ounce heavier than the birthweight of a 35-year-old mother, when both mothers carry full term. But again, that rate of change is the same irrespective of the gestational length, so we would expect the same difference for babies born at 35 weeks, 40 weeks, and 42 weeks. So the coefficient of 0 point 17 ounces per year is the slope of the plane for a fixed length of gestation. Again, since this slope is the same for all gestational lengths, we can interpret this coefficient as the effect of the mother’s age on birthweight, while holding gestational length constant. It’s also apparent from the plot that the colors change more rapidly as you move horizontally as opposed to vertically. This reflects the fact that the coefficient on gestation is bigger than the coefficient on age.
mode_space1 + geom_vline(xintercept = 280, color = 'red')
6. Coefficient interpretation
Now, you might be tempted to think that bigger coefficients are always more important, but this is not true. The value of the coefficients depend on the units of the explanatory variables. In this case, one variable is in the units of days (of gestational length), while the other is in the units of years (of mother’s age). They are not directly comparable. When you interpret coefficients from a multiple regression model, be sure to always include a phrase to the effect of “holding x constant.” Another good alternative is “after controlling for x.” This information is crucial to having a valid understanding of a regression model.
7. Let’s practice!
You’ll demonstate your understanding of these concepts in the next exercises.
1. Adding a third (categorical) variable
2. How could we forget about smoking?
No study of birthweight is complete without some discussion of the effect of smoking, which is known to have all kinds of undesireable health consequences. Mothers in the babies data set have their smoking status recorded as a binary variable: either she was or was not a smoker. Adding this third explanatory variable to our model is again easy. Since this variable is encoded as 0 or 1, no transformation is necessary and we can simply add another term to the mathematical model and the formula syntax to specify our new model.
3. Geometry
But here again the geometric changes are not as easy to see. However, our emphasis on the geometry of these models should give you some intuition. Recall that the addition of a categorical explanatory variable to a numeric explanatory variable changed a line into parallel lines. Moreover, a model with two numeric explanatory variables was a plane. Can you guess how the addition of a categorical explanatory variable will change the geometry of a model with two numeric explanatory variables? If you guess parallel planes, you’re right!
4. Drawing parallel planes in 3D
Once again, we can use plotly to create a 3D image of our model in the data space. Here we have used the add_surface() function twice: once to add a plane for non-smokers, and again to add another plane for smokers. The plane on “top” is the one for non-smokers—can you think of why this might be case? We’ll return to this question in a minute.
plot_ly(data = babies, z = ~bwt, x = ~gestation, y = ~age, opacity = 0.6) %>%
add_markers(color = ~factor(smoke), marker = list(size = 2)) %>%
add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE, cmin = 0, cmax = 1, surfacecolor = 'red', colorscale = 'red') %>%
add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE, cmin = 0, cmax = 1, surfacecolor = 'blue', colorscale = 'blue')
Ignoring 15 observationsSome values were outside the color scale and will be treated as NAminimal value for n is 3, returning requested palette with 3 different levels
Ignoring 15 observationsSome values were outside the color scale and will be treated as NAminimal value for n is 3, returning requested palette with 3 different levels